CrossFit Open 2019 Analysis - Benchmark Workouts

Posted on June 15, 2019 in CrossFit, Python

TODO

  • [X] Switch colors
  • [X] Write function to show describe statistics per sex
  • [ ] Split into 3 posts
  • [ ] Hide code blocks?
  • [ ] Write intro, outro paragraphs and explanatory paragraphs for the main ones
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly_express as px
from scipy import stats
import seaborn as sns
import sqlalchemy as sql

from IPython.display import display_html

I've collected profile data for the 1,706,699 CrossFit athletes that have ever (presumably) participated in the CrossFit Open. Having all these statistics in one place is useful as a reference for comparison, and interesting to see how Open athletes fill their profile.

It is important to remember that these statistics are all optional and self-reported. Right off the bat, the biggest problem is that people will typically not report the numbers they aren't proud of. So it's important to take this analysis with a grain of salt, and not treat it like data collected to write a scientific paper. We can still draw some very important conclusions from the analysis, keeping in mind that the numbers are probably rounded, excluded, outdated or outright false, all in order to look good. I'll do my best to include the sample size and trends in the analyses.

Goal

The reason behind this analysis is — you've guessed it, trying to optimize my training and working on my weaknesses. The sport of CrossFit is about being an all-round athlete, so if I see that I'm in the 15th percentile for one movement but 85th in another, I'll want to work on the first one instead of the second. Of course, any coach worth his/her salt would be able to tell you this, but it's another to be able to put a number on it. Put succintly:

To provide CrossFit athletes with a tool to see where their performance sits in regards to a small number of exercise standards and benchmarks. With this tool, athletes should be able to see:

  • What their strengths and weaknesses are compared to other CrossFit Open participants
  • Where they should focus their training for next season

Methodology

I scraped these 1.7M+ athlete profiles from the CrossFit Games profiles over the course of several days. A more thorough going into the actual methodology used to scrape, parse and load that data into a database will be the subject of another article.

Variables

We find a small number of athlete stats on their profile page:

  • Athlete Stats
    • Age
    • Sex
    • Height
    • Weight
  • Weightlifting
    • Back Squat
    • Clean and Jerk
    • Snatch
    • Deadlift
  • Benchmark Workouts
    • Fight Gone Bad
    • Fran
    • Grace
    • Helen
    • Filthy 50
  • Bodyweight Exercices
    • Max Pull-ups
    • Sprint 400m
    • Run 5k

As this list can be quite extensive, we'll be splitting this post into the four sections above. Another reason for this is quite simple: the interactive visualization package I'm using to make the graphs make for very large images, and I want to keep load times to a respectable minimum.

Inspiration

This post is largely inspired by Sam Swift's 2015 post called "What's normal (or top 5%) for a CrossFit athlete?" and a huge thank you to him for having created it in the first place. Be sure to check out his other posts on the sport, they're the best out there, bar none.

I won't be using his data directly nor do I have the time to make comparisons across the four years separating the posts, but it'd certainly be interesting to see how the sport has evolved with time.

Pre-processing

Even though a lot of the pre-processing was done in the steps leading to this analysis (see upcoming post on scraping methodology), we still have to deal with assigning sexes to participants, and with the all-important question of how to deal with missing values.

Pull the data

In this case, our data was saved as a table in a PostgreSQL database running locally.

In [2]:
URI_DB = "postgres://leblancfg@localhost:5432/cf_analysis"
db = sql.create_engine(URI_DB)

df = pd.read_sql("cf_athletes", db, index_col="id")
df.sample(5)
Out[2]:
name country division age height weight affiliate fran helen grace filthy50 fgonebad run400 run5k candj snatch deadlift backsq pullups modified_date
id
565539 Monica Almada None None 25 NaN NaN None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.553045e+09
1070889 Dean Valenzuela United States Men (45-49) 45 5.83 177.03 CGS CrossFit NaN NaN NaN NaN NaN NaN NaN 191.8 154.32 440.92 294.98 NaN 1.553176e+09
1263686 Juanita Florez None None 0 NaN NaN None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.553218e+09
1430901 Christina Luger None None 0 NaN NaN None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.553254e+09
1136381 Joshua St John None None 43 NaN NaN None NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.553191e+09

As we can see, a lot of the athletes simply input their name, division and age; they either don't bother entering their stats, or don't know them well enough.

Sex

First off, we know already that a sizeable fraction of the athletes disclose sex in their profile, in the form of the Division column.

In [40]:
F_STRINGS = ["women", "girls"]
COLOUR_SCHEME = px.colors.colorbrewer.Set1


def describe(df, col):
    """Takes a DataFrame and column name, prints a groupby-describe"""
    display(df.groupby("sex")[col].describe().round(1))


def parse_division(text):
    """Given a division title, e.g. 'Men', returns sex as 'F', 'M' or None"""
    if text is None:
        return None
    if any(word in text.lower() for word in F_STRINGS):
        return "F"
    return "M"


df["sex"] = df["division"].apply(parse_division)
has_division = sum(df["sex"].isna())

print(
    f"There are {round(100 * has_division / len(df))}% of athletes that submitted sex"
)
There are 79% of athletes that submitted sex

Country

In [4]:
# countries = 
# px.scatter_geo(df, 'country', locations="iso_alpha"projection="natural earth")

Body Measurements

Let's make a new datarame containing the data of the participants who've included weight, height and age stats. This will whittle down our possible number of participant data by quite a bit, but this way we should be able to get decent quality in our numbers.

We also see a non-insignificant number of bogus entries, with participants whose declared age is 125, weight is 20 lb, or height is over 10 feet. So we take care of that in one fell swoop.

In [5]:
# Make new DataFrame called `bm` with just the ones with body measurements
bm = df[~df["sex"].isna()].query("80 < weight < 400 & 4 < height < 7 & 13 < age < 81")

print(
    f"Body measurement statistics available for {round(100 * len(bm) / len(df))}%"
    f' of athletes out of a possible {"{:,}".format(len(df))}.'
)
Body measurement statistics available for 11% of athletes out of a possible 1,706,699.

Age

In [6]:
px.histogram(
    bm,
    "age",
    nbins=128,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Age Distribution % by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [41]:
describe(bm, "age")
count mean std min 25% 50% 75% max
sex
F 58233.0 34.5 9.5 14.0 28.0 33.0 40.0 77.0
M 121472.0 35.4 9.3 14.0 29.0 34.0 41.0 80.0

Height

In [7]:
height = bm.query("4.5 < height < 7")
px.histogram(
    height,
    "height",
    nbins=32,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Height Distribution % by Sex (in decimal feet)",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [42]:
describe(bm, "height")
count mean std min 25% 50% 75% max
sex
F 58233.0 5.4 0.2 4.1 5.2 5.4 5.6 6.8
M 121472.0 5.9 0.2 4.0 5.7 5.9 6.0 6.9

These distributions are very, very interesting. Obviously there's the gigantic notch at 6' for the men and 5'6" for the women — but we can probably ascribe that to the fact that it's self-reported data.

But that's OK! Looks like the 5'10" is still the self-reported average across men and 5'5" for the women.

Weight

In [8]:
px.histogram(
    bm,
    "weight",
    nbins=128,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Weight Distribution % by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [38]:
describe(bm, "weight")
count mean std min 25% 50% 75% max
sex
F 58233.0 142.0 21.8 80.0 127.9 140.0 152.1 381.4
M 121472.0 187.6 26.6 80.0 170.0 185.0 201.9 399.9

Now when it comes to weight, though, the situation is a bit different.

For the men, you could almost way that the distribution is bimodal: one of the peaks is around 175, and that there's another, much sharper peak at 185. The situation is reversed for the women, where the peak is around 130.

Hard to say if we can say that's caused by athletes trying to maintain a certain weight, wishful thinking, or rounding to a nearest multiple of 5 — in reality, the situation is probably in-between those hypotheses.

Height / Mass Ratio

In [9]:
px.density_contour(
    bm,
    x="height",
    y="weight",
    marginal_x="box",
    marginal_y="box",
    color="sex",
    trendline="lowess",
    labels={"height": "Height (feet)", "weight": "Weight (lb)"},
    title="CrossFit Open 2019 Athlete Pages — Distribution of Height and Weight by Sex",
    color_discrete_sequence=COLOUR_SCHEME,
)

I wish I could have made the density contours a little less sharp, because we're seeing clumps around every inch. But still, at least we can see the overall shape, and even a nicely fit trendline showing the trend weight for a given height.

We also find a large number of outliers in the weight distributions, but most of them are located near the top — safe to assume we're seeing traces of America's obesity problem right there.

Exercises

Deadlift

In [10]:
dl = bm.query("0 < deadlift < 800")
px.histogram(
    dl,
    "deadlift",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Deadlift by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [43]:
describe(dl, "deadlift")
count mean std min 25% 50% 75% max
sex
F 23181.0 246.5 54.3 1.1 210.1 244.9 285.1 694.5
M 58144.0 391.2 78.5 1.1 341.7 396.8 440.9 740.1

From anecdotal evidence, DAAAANG people are strong! And a lot of people are submitting these, too, so it sure seems as though athletes can regularly crank these weights. I guess that means back to the barbell for this man... ha!

Back Squat

In [11]:
bs = bm.query("35 < backsq < 600")
px.histogram(
    bs,
    "backsq",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Back Squat by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [46]:
describe(bs, "backsq")
count mean std min 25% 50% 75% max
sex
F 22489.0 199.6 48.9 35.0 165.4 200.0 229.9 524.9
M 56672.0 319.4 71.7 35.3 274.9 315.0 365.1 595.2

Clean & Jerk

In [32]:
candj = bm.query("20 < candj < 400")
px.histogram(
    candj,
    "candj",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Clean & Jerk by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [47]:
describe(candj, "candj")
count mean std min 25% 50% 75% max
sex
F 20345.0 139.2 35.1 22.9 115.1 136.7 164.9 352.7
M 51829.0 224.2 49.8 20.1 187.4 225.1 255.1 399.9

Snatch

In [48]:
snatch = bm.query("5 < snatch < 350")
px.histogram(
    snatch,
    "snatch",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Max Snatch by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [49]:
describe(snatch, "snatch")
count mean std min 25% 50% 75% max
sex
F 19483.0 106.2 30.2 6.6 85.1 104.9 125.0 330.7
M 50245.0 172.9 43.1 5.1 143.3 172.0 200.6 343.9

Bodyweight Exercises

Pull-ups

In [50]:
pu = bm.query("0 < pullups < 101")
px.histogram(
    pu,
    "pullups",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Pullups by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [51]:
describe(pu, "pullups")
count mean std min 25% 50% 75% max
sex
F 6757.0 18.2 12.8 1.0 8.0 15.0 25.0 100.0
M 23238.0 31.1 15.9 1.0 20.0 30.0 41.0 100.0

Run 400 m

In [52]:
run_400 = bm.query("44 < run400 < 200")
px.histogram(
    run_400,
    "run400",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of 400 m Run Times by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [53]:
describe(run_400, "run400")
count mean std min 25% 50% 75% max
sex
F 3120.0 90.0 20.9 45.0 76.0 87.0 100.0 195.0
M 11492.0 73.9 17.5 45.0 62.0 70.0 81.0 197.0

Run 5km

In [54]:
run_5k = bm.query("777 < run5k < 3000")
px.histogram(
    run_5k,
    "run5k",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of 5 km Run Times by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [55]:
describe(run_5k, "run5k")
count mean std min 25% 50% 75% max
sex
F 6442.0 1596.0 280.3 808.0 1408.0 1560.0 1740.0 2979.0
M 20203.0 1395.6 240.6 804.0 1229.0 1350.0 1502.0 2965.0

A lot of outliers here, but at least we know that they probably didn't beat the World record time of 777 seconds.

Aside

Now, my PR is 22 minutes (or 1320 seconds), which I thought was respectable. Turns out:

In [56]:
men_5k = run_5k[run_5k["sex"] == "M"]["run5k"]
print(
    f"A 22 minute 5 km run for a man is in the {round(stats.percentileofscore(men_5k, 22 * 60), 1)}th percentile"
)
A 22 minute 5 km run for a man is in the 43.7th percentile

So... it's better than the mean and interquartile median, but the distribution is heaily skewed towards the mode, that's sitting around 21 minutes. I've got some catching up to do!

Named Workouts

Fran

Fran is one of the most well-known of CrossFit workouts, as it's a simple but deadly couplet of:

21-15-9 reps of:

  • thrusters (95 / 65 lb) and
  • pull-ups, for time.
In [18]:
fran = bm.query("45 < fran < 800")

px.histogram(
    fran,
    "fran",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Fran Workout Times, by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [57]:
describe(fran, "fran")
count mean std min 25% 50% 75% max
sex
F 8540.0 343.7 127.1 60.0 246.0 328.0 423.0 795.0
M 28985.0 289.6 124.2 50.0 192.0 262.0 358.0 798.0

We see here first a very big difference in overall numbers between men and women, as tracking — or at least entering — those numbers seem to be much more popular with the former.

For the men, we see a peak for the 175-179 second bin, where we see athletes pushing to get their times under 3 minutes. A lofty goal indeed! There are similar but less pronounced peaks at and before round numbers.

For the women, that elongated peak shape is more evenlu distributed around five minutes, with many athletes under than all-important 5 minute mark.

Grace

Another classic, Grace is a deceptively simple benchmark, consisting of:

30 clean-and-jerks of 135/95 lbs, for time.

Nothing more, nothing less.

In [19]:
grace = bm.query("45 < grace < 600")

px.histogram(
    grace,
    "grace",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Grace Workout Times, by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [58]:
describe(grace, "grace")
count mean std min 25% 50% 75% max
sex
F 7318.0 222.5 90.0 60.0 158.0 204.0 268.0 598.0
M 22079.0 198.9 85.9 46.0 138.0 178.0 238.0 599.0

Similar to Fran, we find skewed distributions with peaks at and before the minute marks.

For men, a good first milestone would be at the three minute mark, and a lump of elite athletes peaking under two minutes.

For the women, that ridge in the under-three-but-over-two minute numbers is smooshed out further, where the median is still just under the 3-minute mark.

Helen

An all-time favorite of mine, Helen consists of:

3 rounds for time:

  • 400 m run
  • 21 american kettlebell swings (55/36 lb)
  • 12 pull-ups
In [20]:
helen = bm.query("250 < helen < 1200")

px.histogram(
    helen,
    "helen",
    nbins=64,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Helen Workout Times, by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [59]:
describe(helen, "helen")
count mean std min 25% 50% 75% max
sex
F 5293.0 688.3 131.6 260.0 592.0 673.0 767.0 1194.0
M 16300.0 611.0 125.3 258.0 522.0 587.0 679.0 1198.0

Now, here it sure looks like there's more room to play than the previous two! We see a log-normal distribution here, but with extremely wider tails compared to the other two benchmark workouts. The fact that we see three modalities (run, KBS, PU) in here to me indicates that it's harder to optimize it to a science.

For men, the mean is at 10 m 11 s, and the median is at 9 m 47 s. We see the largest peak right under 9 minutes, and another big peak right under the 8 minute mark, at 479 seconds.

For women, the mean is at 11 m 28 s, and the median at 10 m 37 s. The leargest peak is this case is rght under the 10 minute mark.

CrossFit Benchmarks

Filthy50

This one was new to me as I had never heard of it before, but seems dreadful in a beautiful kind of way. I'm actually excited to try it soon! The Filthy Fifty is:

For Time:

  • 50 Box Jumps (24/20 in)
  • 50 Jumping Pull-Ups
  • 50 Kettlebell Swings (1/.75 pood)
  • 50 Walking Lunges
  • 50 Knees-to-Elbows
  • 50 Push Press (45/35 lb)
  • 50 Back Extensions
  • 50 Wall Balls (20/14 lb)
  • 50 Burpees
  • 50 Double-Unders

Obviously a good test of overall athletic ability, the F50 number seems like an excellent single number to use to determine both stamina and technical ability across a wide range of exercises. Even though it does not contain any "heavy" movements, it feels like the kind of benchmark you want to focus on coming into the Open, for example.

In [21]:
filt50 = bm.query("500 < filthy50 < 3000")
px.histogram(
    filt50,
    "filthy50",
    nbins=128,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Filthy Fifty Workout Times, by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [60]:
describe(filt50, "filthy50")
count mean std min 25% 50% 75% max
sex
F 3006.0 1656.5 351.7 561.0 1407.0 1623.5 1859.0 2993.0
M 8770.0 1572.1 365.8 501.0 1307.0 1531.0 1785.0 2995.0

For once, we see similar time numbers for both men and women, at least compared to the other workouts. Again though, we see higher "peakedness" for the men, with the mode right under the 25-minute mark. For the men, under 22 minutes will put you in the 25% percentile, and under 23 minutes 30 seconds for women.

Fight Gone Bad

In what seems to be one of the hardest "old-school" benchmarks, lastly comes the Fight Gone Bad. A sprint-your-a$s-off-but-pace-yourself workout combining a shoulder-heavy mix of exercices that takes lots of strategy, planning, and mental fortitude, Fight Gone Bad represents a good number for competition potential and capacity to approach workouts with high numbers in mind.

3 Rounds For Total Reps in 17 minutes

  • 1 minute Wall Balls (20/14 lb)
  • 1 minute Sumo Deadlift High-Pulls (75/55 lb)
  • 1 minute Box Jumps (20 in)
  • 1 minute Push Press (75/55 lb)
  • 1 minute Row (calories)
  • 1 minute Rest
In [22]:
fgb = bm.query("100 < fgonebad < 600")
px.histogram(
    fgb,
    "fgonebad",
    nbins=128,
    color="sex",
    barmode="group",
    title="CrossFit Open 2019 Athlete Pages — Distribution % of Fight Gone Bad Workout Times, by Sex",
    marginal="box",
    color_discrete_sequence=COLOUR_SCHEME,
)
In [61]:
describe(fgb, "fgonebad")
count mean std min 25% 50% 75% max
sex
F 4753.0 278.8 58.7 103.0 237.0 276.0 315.0 595.0
M 13309.0 311.8 63.1 101.0 269.0 308.0 349.0 593.0

We see here large peaks at 250, 300 and 400, and apparently some participants cranking numbers past 500. An interesting metric in itself, I have the impression that I would use this benchmark in a later blog post focusing on predicting factors for high placements in the Open.